Skip to content

Chunking data inserted into knowledge base#195

Merged
ea-rus merged 1 commit intomainfrom
kb-insert-chunking
May 12, 2025
Merged

Chunking data inserted into knowledge base#195
ea-rus merged 1 commit intomainfrom
kb-insert-chunking

Conversation

@ea-rus
Copy link
Copy Markdown
Collaborator

@ea-rus ea-rus commented May 12, 2025

Updated insert method of knowledge base:

  • If data is list or dataframe, it is chunked using MAX_INSERT_SIZE setting (current value 1000)

It is done for optimisation of resource usage (both client and server)

Comment on lines +22 to +30
def split_data(data: Union[pd.DataFrame, list], partition_size: int) -> Iterable:
"""
Split data into chunks with partition_size and yield them out
"""
num = 0
while num * partition_size < len(data):
# create results with partition
yield data[num * partition_size: (num + 1) * partition_size]
num += 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness: The split_data function doesn't handle empty datasets, which could lead to an infinite loop if an empty DataFrame or list is passed to it.

📝 Committable Code Suggestion

‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
def split_data(data: Union[pd.DataFrame, list], partition_size: int) -> Iterable:
"""
Split data into chunks with partition_size and yield them out
"""
num = 0
while num * partition_size < len(data):
# create results with partition
yield data[num * partition_size: (num + 1) * partition_size]
num += 1
def split_data(data: Union[pd.DataFrame, list], partition_size: int) -> Iterable:
"""
Split data into chunks with partition_size and yield them out
"""
if len(data) == 0:
return
num = 0
while num * partition_size < len(data):
# create results with partition
yield data[num * partition_size: (num + 1) * partition_size]
num += 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI is wrong here because if data is empty then len(data) is 0 and the loop body won't be executed

@github-actions
Copy link
Copy Markdown

Coverage

Coverage Report
FileStmtsMissCoverMissing
mindsdb_sdk
   agents.py2115673%33, 101, 104, 107, 110, 118, 126, 146, 167, 178, 181, 185, 187, 189, 191, 193, 195, 257, 270, 281, 292, 296–311, 323, 332–336, 343–344, 392, 400–401, 454, 503–504, 507, 514, 535–537, 541–545
   databases.py45296%109, 137
   handlers.py39197%77
   jobs.py97793%40, 52, 80, 84, 146–149
   knowledge_bases.py1341589%66–69, 135, 161, 190, 197, 201–203, 207, 229–233, 243
   ml_engines.py42393%94, 126, 128
   models.py2101991%109, 140–141, 222, 231, 233, 303, 339, 348, 372, 397, 491, 499, 518, 534, 542, 567, 571, 584
   projects.py63198%160
   query.py13192%14
   skills.py53394%43, 45, 49
   tables.py1301588%140–142, 145, 165, 192, 203–204, 209, 224, 227, 321, 342–347, 356, 376
   views.py37295%105, 138
mindsdb_sdk/connectors
   rest_api.py2555280%19–29, 35–36, 51, 55, 58–59, 79–81, 102, 105, 112–115, 148–156, 177–178, 213–216, 230–231, 285–290, 294–306
mindsdb_sdk/utils
   agents.py50492%72, 79–81
   mind.py47470%1–128
   openai.py853065%37–40, 83–85, 107, 148–158, 215–216, 234–240, 258–276
   table_schema.py21210%1–54
TOTAL161927983% 

Tests Skipped Failures Errors Time
28 0 💤 0 ❌ 0 🔥 11.389s ⏱️

Copy link
Copy Markdown

@fshabashev fshabashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation bot moved this from to review to approved in Tracking PRs May 12, 2025
@ea-rus ea-rus merged commit f629eb5 into main May 12, 2025
7 checks passed
@ea-rus ea-rus deleted the kb-insert-chunking branch May 12, 2025 14:22
@github-project-automation github-project-automation bot moved this from approved to merged in Tracking PRs May 12, 2025
@github-actions github-actions bot locked and limited conversation to collaborators May 12, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

Status: merged

Development

Successfully merging this pull request may close these issues.

4 participants